Baysian Classification and Naive Bayes for genre classification using lyrics

In this notebook we look at the problem of classifying songs to three genres (rap, rock and country) based on a simple binary bag of words representation. First we load the data and then we take a look at it. Using our implementation of discrete random variables we generate new random songs. Finally we show how classification can be performed using Bayes Rule. The data comes for the lyrics of songs from the Million Song DataSet and was created for an assignment in my course on MIR.

Data layout and visualization

The data layout and the way the classifier is implemented is not general and not optimal but done for pedagogical purposes. Each genre consists of 1000 tracks and the matrix containing the data is ordered by genre. That way the instances corresponding to a genre can easily be found by the index without having to check the class label as would be the case with a general classifier.

We have created a dictionary of 30 words by taking the 10 "best" words based on tf-idf score for each genre. Each track is represented by a binary vector (a row in the matrix) with ones for dictionary words that are in the track and 0 for words that are not. So the matrix is 3000 instances (3 * 1000 per genre) by 30 for each word in the dictionary. When visualized one can observe the block structure that shows that the the rap tracks have a lot of words from the first 10 words in the dictionary that are characteristic of rap.


In [4]:
%matplotlib inline 
import matplotlib.pyplot as plt
import pickle 
import numpy as np


# load some lyrics bag of words data, binarize, separate matrix rows by genre 
data = np.load('data/data.npz')
a = data['arr_0']
a[a > 0] = 1
labels = np.load('data/labels.npz')
labels = labels['arr_0']
dictionary = pickle.load(open('data/dictionary.pck','rb'), encoding='latin1')
word_indices = [  41, 1465,  169,  217, 1036,  188,  260,  454,  173,  728,  163,
        151,  107,  142,   90,  141,  161,  131,   86,   73,  165,  133,
         84,  244,  153,  126,  137,  119,   80,  224]
words = [dictionary[r] for r in word_indices]

# binary row vectors separate by genre (rap, rock, country)
ra_rows = a[0:1000,:]
ro_rows = a[1000:2000,:]
co_rows = a[2000:3000,:] 
print(ra_rows.shape, ro_rows.shape, co_rows.shape)
plt.imshow(a, aspect='auto', cmap='gray')


(1000, 30) (1000, 30) (1000, 30)
Out[4]:
<matplotlib.image.AxesImage at 0x106490d30>

Calculating the 30-dimensional word probability vector for each genre

Let's calculate the word probability vector for each genre and then look at the most probable words for each genre in our data as well as how particular songs are represented as bag of words. We can calculate the probabilities of each word in the dictionary for the songs in each genre by summing the columns of the part of the matrix that corrsponds to each genre. As some words might not appear at all I have added 1.0 to both the numerator and denominator. This is a simple form of what's called additive smoothing which is a common technique to avoid zeros for any class conditional probabilities that would lead to the whole likelihood being zero.


In [41]:
# calculate word counts for each genre 
word_probs_ra = (ra_rows.sum(axis=0).astype(float) + 1.0) / (len(ra_rows)+1.0)
word_probs_ro = (ro_rows.sum(axis=0).astype(float) + 1.0) / (len(ro_rows)+1.0)
word_probs_co = (co_rows.sum(axis=0).astype(float) + 1.0) / (len(co_rows)+1.0)

# Let's llok at the word probabitites for rap music 
for w in zip(word_probs_ra, words): 
    print(w)


(0.087912087912087919, 'de')
(0.18581418581418582, 'niggaz')
(0.43956043956043955, 'ya')
(0.062937062937062943, 'und')
(0.28271728271728269, 'yall')
(0.057942057942057944, 'ich')
(0.41258741258741261, 'fuck')
(0.50849150849150848, 'shit')
(0.41158841158841158, 'yo')
(0.3126873126873127, 'bitch')
(0.17982017982017981, 'end')
(0.11688311688311688, 'wait')
(0.17182817182817184, 'again')
(0.1968031968031968, 'light')
(0.23276723276723277, 'eye')
(0.12087912087912088, 'noth')
(0.11188811188811189, 'lie')
(0.14185814185814186, 'fall')
(0.21478521478521478, 'our')
(0.16283716283716285, 'away')
(0.17382617382617382, 'gone')
(0.26973026973026976, 'good')
(0.22477522477522477, 'night')
(0.095904095904095904, 'blue')
(0.18981018981018982, 'home')
(0.18381618381618381, 'long')
(0.24175824175824176, 'littl')
(0.21378621378621379, 'well')
(0.16483516483516483, 'heart')
(0.14185814185814186, 'old')

Checking out the words in some songs using the binary representation

Each row of the feature matrix contains ones for each word that is present in the song. We can view the words of any particular song by mapping these ones using the dictionary of words. Let's view the words in the 20th track (row of the matrix) of each genre and then look at track 250.


In [21]:
#let's look at the bag of words for three particular songs 
track_id = 20
print(track_id)
print("RAP for trackid:",[words[i] for i,r in enumerate(ra_rows[track_id]) if r==1])
print("ROCK for trackid:",[words[i] for i,r in enumerate(ro_rows[track_id]) if r==1])
print("COUNTRY for trackid:",[words[i] for i,r in enumerate(co_rows[track_id]) if r==1])

track_id = 250 
print(track_id)
print("RAP for trackid:",[words[i] for i,r in enumerate(ra_rows[track_id]) if r==1])
print("ROCK for trackid:",[words[i] for i,r in enumerate(ro_rows[track_id]) if r==1])
print("COUNTRY for trackid:",[words[i] for i,r in enumerate(co_rows[track_id]) if r==1])


20
RAP for trackid: ['ya', 'yo', 'end', 'wait', 'our', 'old']
ROCK for trackid: ['again', 'light', 'fall', 'our', 'away', 'good', 'night', 'blue', 'long', 'well']
COUNTRY for trackid: ['long', 'well', 'old']
250
RAP for trackid: ['niggaz', 'ya', 'yo', 'bitch', 'away', 'home', 'long', 'heart']
ROCK for trackid: ['fuck', 'noth', 'lie', 'gone']
COUNTRY for trackid: ['yall', 'wait', 'our', 'good', 'long', 'littl', 'well']

In [32]:
# let's look at the k most probable words for each genre based on the data we have 
k = 5
[[words[x] for x in np.argpartition(word_probs_ra, -k)[-k:]],
 [words[x] for x in np.argpartition(word_probs_ro, -k)[-k:]],
 [words[x] for x in np.argpartition(word_probs_co, -k)[-k:]]]


Out[32]:
[['bitch', 'ya', 'shit', 'yo', 'fuck'],
 ['our', 'heart', 'night', 'away', 'eye'],
 ['littl', 'long', 'well', 'heart', 'night']]

Generating random songs based on our simplified representation

Now let's generate some random songs represented as bag of words using the calculated word probabilities for each genre. This way we can understand better the assumptions and simplifications of this model. I simply generate 30 random number and then depending on the class-conditional probabilities for a particular genre if the number is great than the random number the corresponding word is selected for generation. This gives us a clear idea of what assumptions this Binomial Naive Bayes classifier makes. Running the cell multiple times show the variation we get from this very simple model.


In [37]:
print('Random rap', [w for (i,w) in enumerate(words) if np.greater(word_probs_ra, np.random.rand(30))[i]])
print('Random rock', [w for (i,w) in enumerate(words) if np.greater(word_probs_ro, np.random.rand(30))[i]])
print('Random country', [w for (i,w) in enumerate(words) if np.greater(word_probs_co, np.random.rand(30))[i]])


Random rap ['und', 'yo', 'night', 'home']
Random rock ['und', 'noth', 'gone', 'heart']
Random country ['home', 'littl', 'well', 'heart', 'old']

Using the calculated word probabilities to make a classifier

Now let's look at classifying songs using a naive Bayes Bernoulli classifier. When the representation is binary vectors indicating absense or presence of words it is called a Bernoulli Naive Bayes. If the times a word appears in a document affect the classification it is called a Multinomial text classifier.

To make a classification decision we simply calculate the likelihood for each genre independently by taking the products of the genre dependent word probabilities. The genere with the highest likelihood is selected as the predicted class. In a more realistic implementation log-likelihoods would be used to avoid problems with small numbers. Notice that when a word is absent the probability it is absent (1 - the probability it is present) is used.


In [42]:
# calcuate likelihood separately for each word 
# using naive bayes assumption and multiply 
# typically a sum of log-likelihoods is used 
# rather than a multiplication. 
def likelihood(test_song, word_probs_for_genre): 
    probability_product = 1.0 
    for (i,w) in enumerate(test_song): 
        if (w==1): 
            probability = word_probs_for_genre[i]
        else: 
            probability = 1.0 - word_probs_for_genre[i]
        probability_product *= probability 
    return probability_product

Using the trained classifier to predict

Now that we have a function to compute the likelihood given the parameters of a particular model in this case the model parameters are the probabilities for each word. We have three models to compare one for each genre. Given a test song we compute the three likelihoods and select the largest. We can randomly select a track from the country rows and then apply our predict function to see what it does. If you run the cell multiple times you will see that for most country tracks the prediction is correct but mistakes are made occassionally.


In [43]:
def predict(test_song): 
    scores = [likelihood(test_song, word_probs_ra), 
             likelihood(test_song, word_probs_ro),
             likelihood(test_song, word_probs_co)]
    labels = ['rap', 'rock', 'country']
    return labels[np.argmax(scores)]


# predict a random country track 
track_id = np.random.randint(1000)
print("Random track id", track_id)
test_song = co_rows[track_id]
print(predict(test_song))


Random track id 737
rock

Performing a simple evaluation of our classifier

We can now write a function that given a test set and associated ground truth lables runs our Bernoulli classifier and calculates the associated classification accuracy. We can now check how well the classifier does for each subset of the data corresponding to the three genres. Using the data used to trained the classifier for testing as we do here is a methodological mistake and in a more realistic scenario or application a separate dataset would be used for testing and the processing could be repeated multiple times using a scheme like k-fold cross-validation. As the purpose of this notebook is to illustrate how probabilities are used to create a Naive Bayes classifier I don't bother with that.


In [44]:
def predict_set(test_set, ground_truth_label): 
    score = 0 
    for r in test_set: 
        if predict(r) == ground_truth_label: 
            score += 1
    # convert to percentage 
    return score / 10.0 



# Let's evaluate how well our classifier does on the training set 
# A more proper evaluation would utilize cross-validation 

print("Rap accuracy% = ", predict_set(ra_rows, 'rap'))
print("Rock accuracy% = ", predict_set(ro_rows, 'rock'))
print("Country accuracy% = ", predict_set(co_rows, 'country'))


Rap accuracy% =  74.9
Rock accuracy% =  63.1
Country accuracy% =  70.9

Naive Bayes in general

This notebooks explores how a simple probabilistic model based on a binary representation for a bag of words can be used for classification. This is a toy example with a lot of assumptions and conventions that make things easier in terms of implementation. In an actual implementation the number of words in the dictionary would not be given but calculated from the data, the instances would be shuffled so to calculate the probabilities the class output field of every instance would have to be examined. The number of classes would not be fixed and loops for iterating over classes and over attributes would be written. In the computation of the likelihood a log-likelihood would be used instead and the list goes on. You can find more information about this in most textbook that describe Naive Bayes classifiers.


In [ ]: